The project aims to build a pipeline for collection M. tuberculosis NGS data and the application of unsupervised learning methods to characterise the population structure in real-time. The project has three stages:
- A number of unsupervised learning techniques will be applied to a large database of publicly available isolate sequences to identify the optimum methods to determine population structure. Techniques will be assessed based on speed, scalability and accuracy.
- A backend to the TB-Profiler webserver will be developed to integrate frameworks from aims 1 and 2.
- A data protection impact assessment will be performed to in compliance with GDPR. This will aim to characterise how the service interface with user data and minimise potential data security risks. A security policy will be created to ensure that the project developers are knowledgeable on data privacy and security.

